-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove remaining uses of FFI under -fpure-haskell #660
Conversation
All of these were standard C functions that GHC's JS backend actually somewhat supports; their shims can be found in the compiler source at "rts/js/mem.js". But it seems simpler to just get rid of all FFI uses with -fpure-haskell rather than try to keep track of which functions GHC supports. The pure Haskell implementation of memcmp runs about 6-7x as fast as the simple one-byte-at-a-time implementation for long equal buffers, which makes it... about the same speed as the pre-existing shim, even though the latter is also a one-byte- at-a-time implementation! Apparently GHC's JS backend is not yet able to produce efficient code for tight loops like these yet; the biggest problem is that it does not perform any loopification so each iteration must go through a generic-call indirection. Unfortunately that means that this patch probably makes 'strlen' and 'memchr' much slower with the JS backend.
cc @hsyl20 |
@luite told me it would be a big performance hit for JS. Or we need to figure out how to optimize recursive functions like these before. As an alternative we could perhaps define GHC primops for common libc operations? That would make |
Or perhaps something provided by |
Alternatively, if GHC performed loopification via join points: https://gitlab.haskell.org/ghc/ghc/-/issues/14068 then we would get it for free. |
We definitely should have a primop for There actually is a version of |
It would be relatively trivial to write an |
It looks like the tail calls of join points get turned into trampolines, which is better than the status quo for non-join-pointed tail calls but still not as good as we'd like for basic self-loops like these, which should be very efficiently implementable by wrapping them in Also, we should be able to just trampoline for known exactly-saturated function calls in tail position instead of producing generic-call code, even if we are not calling a join point. Right? In any case, my inclination is to just accept these performance regressions for now. |
opened: https://gitlab.haskell.org/ghc/ghc/-/issues/24442
sounds good to me. |
All of these were standard C functions that GHC's JS backend actually somewhat supports; their shims can be found in the compiler source at "rts/js/mem.js". But it seems simpler to just get rid of all FFI uses with -fpure-haskell rather than try to keep track of which functions GHC supports. The pure Haskell implementation of memcmp runs about 6-7x as fast as the simple one-byte-at-a-time implementation for long equal buffers, which makes it... about the same speed as the pre-existing shim, even though the latter is also a one-byte- at-a-time implementation! Apparently GHC's JS backend is not yet able to produce efficient code for tight loops like these yet; the biggest problem is that it does not perform any loopification so each iteration must go through a generic-call indirection. Unfortunately that means that this patch probably makes 'strlen' and 'memchr' much slower with the JS backend. (cherry picked from commit 305604c)
All of these were standard C functions that GHC's JS backend actually somewhat supports; their shims can be found in the compiler source at "rts/js/mem.js". But it seems simpler to just get rid of all FFI uses with -fpure-haskell rather than try to keep track of which functions GHC supports.
The pure Haskell implementation of memcmp runs about 6-7x as fast as the simple one-byte-at-a-time implementation for long equal buffers, which makes it... about the same speed as the pre-existing shim, even though the latter is also a one-byte-at-a-time implementation!
Apparently GHC's JS backend is not yet able to produce efficient code for tight loops like these yet; the biggest problem is that it does not perform any loopification so each iteration must go through a generic-call indirection.
Unfortunately that means that this patch probably makes 'strlen' and 'memchr' much slower with the JS backend.
(I noticed this situation while working on #569.)
(This is based on top of #659 to avoid pointless CPP.)